Why Evaluation is Non-Negotiable
You’ve built an incredible, multi-stage RAG pipeline. Now for the hard truth: it’s failing. In production, RAG systems usually fail not because the LLM is bad, but because the context it receives is subtly wrong. A poor retrieval result, even if ranked fifth, can derail the answer. Without a clear evaluation framework, you’re debugging blind, relying on “vibe checks” that fail in production. What Problem It Solves: Systematic evaluation isolates failures. It answers: Did the system hallucinate because the relevant document wasn’t found (Retrieval Failure), or because the LLM ignored the document that was found (Generation Failure)? This single distinction saves countless hours of wasted effort on prompt tuning when the real fix is adjusting your chunking strategy. Real-World Example: Teams often spend weeks trying to fix LLM hallucination with better prompts, only to discover the root cause was a weak embedding model missing critical, domain-specific documents. We’ll learn to diagnose the real issue upfront.The Two-Part Evaluation
The RAG pipeline has two core components that must be evaluated and debugged independently: Retrieval and Generation- Retrieval Quality: Did we find the right documents?
- Generation Quality: Did the LLM use them correctly?
The ARES Framework for Failure Attribution
We’ll use a simplified version of the ARES framework (Answer Relevance, Evidence Support) to diagnose every failure based on three component checks:| Check | Component | Goal | Metric Type |
|---|---|---|---|
| Context Relevance | Retrieval | Did we retrieve the necessary ground-truth document(s)? | Recall@k, NDCG@k |
| Answer Faithfulness | Generation | Is the answer grounded only in the retrieved context? | LLM-as-Judge (Faithfulness) |
| Answer Relevance | Generation | Does the answer address the original query? | LLM-as-Judge (Relevance) |
Part 1: Retrieval Metrics (Did we find it?)
Recall@k: Are the relevant docs in the top-k?- Definition: What fraction of relevant documents did we retrieve?
- When to use: Measuring if all important docs are findable
- Target: Recall@10 > 0.90 (catch 90%+ relevant docs in first 10)
- Definition: What fraction of retrieved documents are actually relevant?
- When to use: Measuring retrieval noise/irrelevance
- Target: Precision@5 > 0.80 (80%+ of top-5 are useful)
- Definition: Normalized Discounted Cumulative Gain - considers ranking order.
- Why it matters: Getting relevant doc at position 1 is better than position 10.
- When to use: Measuring overall retrieval quality (position matters)
- Target: NDCG@10 > 0.80 (strong ranking)
- Definition: Average of 1/rank for first relevant doc.
- When to use: User-focused metric (fast discovery matters)
- Target: MRR > 0.80 (relevant doc in top ~1-2 results)
Setup and Metric Calculation
We’ll use hit_rate (a simple form of recall), MRR, Precision, and NDCG. Full runnable example of retrieval metricsPart 2: Generation Quality
Now evaluate if the LLM used retrieved context correctly. Full runnable example of response evaluationFaithfulness: Is the answer grounded in context?
Definition: Does the answer only contain info from retrieved docs? Target: Faithfulness > 0.90 (less than 10% hallucination)Relevance: Does the answer address the query?
Definition: Is the answer on-topic and helpful for the question? Target: Relevance > 0.85 (answers actually helpful)LLM-as-Judge
Traditional evaluation metrics require labeled ground truth, but for generation quality—measuring faithfulness and relevance—we can leverage LLMs themselves as evaluators. The LLM-as-Judge approach uses a separate, more capable LLM (like GPT-4) to act as an impartial judge that scores whether a response is faithful to the retrieved context and relevant to the original query. Main advantage: The primary benefit of LLM-as-Judge is that it can be used in production environments where there is no ground truth. Unlike retrieval metrics that require manually labeled relevant documents, or traditional answer evaluation that needs expected answers, LLM-as-Judge can evaluate any query-response pair in real-time by comparing the response against the retrieved context. This makes it ideal for continuous monitoring of production RAG systems. Why it works: Modern LLMs are remarkably good at understanding semantic relationships and detecting inconsistencies. When given a query, a response, and the source context, a judge LLM can reliably determine if the answer is grounded in the provided documents (faithfulness) and if it actually addresses what was asked (relevance). Critical requirement: Before deploying LLM-as-Judge in production, you must evaluate and optimize the judge itself using ground truth data. Create a validation set with human-annotated examples (e.g., 50-100 cases where you know the correct faithfulness/relevance scores), run the judge on these cases, and measure its accuracy against human labels. If the judge’s scores don’t align with ground truth (>85% agreement), adjust the judge model, prompts, or temperature settings. This calibration step ensures your judge is actually judging correctly before you rely on it for production monitoring. Trade-offs: While faster and more scalable than human evaluation, LLM-as-Judge has costs (API calls to the judge model) and can occasionally disagree with human annotators on edge cases. Use it for continuous monitoring and initial evaluation, but validate critical failures with human review.Building a Golden Dataset
The most important step: You need ground truth test cases.How to Create Golden Dataset
- Size: 50-100 queries minimum, 500+ ideal
- Diversity: Cover all query types (factual, conceptual, multi-hop)
- Difficulty: Include easy and hard cases
- Updates: Refresh quarterly as corpus changes
- Multiple annotators: Inter-annotator agreement > 0.80
Debugging RAG Failures
When evaluation reveals problems, debug systematically.Continuous Evaluation
- Log 100% of queries (for debugging)
- Evaluate 10% of queries (cost management)
- Full golden dataset evaluation weekly
- Alert on >10% metric drops